This is a tutorial that will walk through data science pipeline using a dataset that contains Top 200 Spotify Charts in Europe from 2019 and also weather data for the different regions.
These days we are living an exceptional situation due to Covid-19 pandemic. Many countries are under lockdown as a measure to stop the spread of the virus. Music, as always, is used by a lot of people to cope with several challenges posed by the situation such as uncertainty, boredom or anxiety among others. Everyone consciously choose music that compliments their feelings in a specific moment as music has the ability to evoke powerful emotional responses in listeners. This is not only in times of crisis as the one we are living these days but it has been like this since the begining of time.
Several studies have proved that music choices reflect people´s emotional preferences, so many others have shown that weather have some physiological effects. In this tutorial we will study how Spotify users´ music choices varied based on the time of the year and other music features.
This tutorial will cover the main data science activities applied to the chosen dataset. These activities are organized and shown in the following workflow:
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(base)
library(ggplot2)
library(RColorBrewer)
library(broom)
library(leaflet)
library(DataExplorer)
library(knitr)
After installing and importing the libraries that will be needed, the next step is to load the data. Nowadays there are a lot of data repositories available in the internet. A great source of datasets and commonly used is Kaggle. Kaggle is the repository from where the dataset that will be used in this turorial was downloaded. Data can be found here.
Another dataset will be used. This second dataset contains countries with their (ISO 3166-1) Alpha-2 code, Alpha-3 code, UN M49, average latitude and longitude. It can be found here. We will use this dataset for interactive data visualization.
Once the datasets have been downloaded and saved in the same folder as this notebook, it is time to load these datasets into the environment.
It is really common to find datasets in comma-separated value (.csv) files. Each line in these files contains attribute values separated by commas. As this format is commonly used, Tydyverse library contains readr R package that provides the read_csv command. read_csv allows to read a dataset stored in a csv file.
We assign to variables called data and countries the results of calling read_csv.
data <- read_csv("final_spotify.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## region = col_character(),
## date = col_date(format = ""),
## spotify_id = col_character(),
## artist = col_character(),
## track_name = col_character()
## )
## See spec(...) for full column specifications.
countries <-read_csv("countries_codes_and_coordinates.csv")
## Parsed with column specification:
## cols(
## Country = col_character(),
## `Alpha-2 code` = col_character(),
## `Alpha-3 code` = col_character(),
## `Numeric code` = col_double(),
## `Latitude (average)` = col_double(),
## `Longitude (average)` = col_double()
## )
head(data)
## # A tibble: 6 x 26
## X1 region date month spotify_id artist track_name position streams
## <dbl> <chr> <date> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 0 AUT 2019-01-02 1 7KPGeiXWD… Capit… Benzema 1 35174
## 2 1 AUT 2019-01-03 1 7KPGeiXWD… Capit… Benzema 1 34237
## 3 2 AUT 2019-01-04 1 7KPGeiXWD… Capit… Benzema 1 35199
## 4 3 AUT 2019-01-05 1 7KPGeiXWD… Capit… Benzema 1 32559
## 5 4 AUT 2019-01-06 1 7KPGeiXWD… Capit… Benzema 1 26956
## 6 5 AUT 2019-01-07 1 7KPGeiXWD… Capit… Benzema 1 35018
## # … with 17 more variables: danceability <dbl>, energy <dbl>,
## # instrumentalness <dbl>, key <dbl>, liveness <dbl>, loudness <dbl>,
## # speechiness <dbl>, acousticness <dbl>, tempo <dbl>, valence <dbl>,
## # explicit <dbl>, temp <dbl>, rain <dbl>, snow <dbl>, cloud <dbl>,
## # humidity <dbl>, const <dbl>
head(countries)
## # A tibble: 6 x 6
## Country `Alpha-2 code` `Alpha-3 code` `Numeric code` `Latitude (aver…
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Afghan… AF AFG 4 33
## 2 Albania AL ALB 8 41
## 3 Algeria DZ DZA 12 28
## 4 Americ… AS ASM 16 -14.3
## 5 Andorra AD AND 20 42.5
## 6 Angola AO AGO 24 -12.5
## # … with 1 more variable: `Longitude (average)` <dbl>
To be able to use a dataset for analysis it is necessary to prepare and organize it.
Now that the dataset has been imported, we can take a look at it and learn more about what contains.
To learn the names of the attributes of this dataset, we can use colnames command.
colnames(data)
## [1] "X1" "region" "date" "month"
## [5] "spotify_id" "artist" "track_name" "position"
## [9] "streams" "danceability" "energy" "instrumentalness"
## [13] "key" "liveness" "loudness" "speechiness"
## [17] "acousticness" "tempo" "valence" "explicit"
## [21] "temp" "rain" "snow" "cloud"
## [25] "humidity" "const"
colnames(countries)
## [1] "Country" "Alpha-2 code" "Alpha-3 code"
## [4] "Numeric code" "Latitude (average)" "Longitude (average)"
Also, we can know more about the different attributes using class (provides the type of a certain attribute), factor (converts an attribute to type facto, categorical) or summary (gives information about an attribute of type factor) commands. Geneally the names of the attributes should be representative, however, sometimes they are not. For example, in this dataset “X1” or “const” do not say much about what they represent. Using the commands previously mentioned we will figure it out.
class(data$X1)
## [1] "numeric"
head(data$X1)
## [1] 0 1 2 3 4 5
summary(factor(data$const))
## 1
## 1425909
After getting some information about the attributes, now we know that “X1” contains just the index of each entity in the dataset and “const” is 1 for every entity.
Now that we know more about our dataset it is time to have our datasets represented in a form that are amenable for manipulation and statistical modeling.
We want our data to be based on rectangular data structures where:
To have a perfectly tidy dataset, we will just eliminate the attributes “const” and “X1” as they do not provide any information about the different entities. Also, we noticed that “month” attribute provides information that can be extracted from “date” so it will also be eliminated. For our analysis, “spotify_id” it is not necessary as it does not provide any useful information so we eliminate it as well.
The dataset countries will just be necessary for interactive data visualization so only “Alpha-3 code”, “Longitude (average)” and “Latitude (average)” attributes are selected.
data<-data%>%
select(-X1,-const, -month, -spotify_id)
head(data)
## # A tibble: 6 x 22
## region date artist track_name position streams danceability energy
## <chr> <date> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AUT 2019-01-02 Capit… Benzema 1 35174 0.78 0.69
## 2 AUT 2019-01-03 Capit… Benzema 1 34237 0.78 0.69
## 3 AUT 2019-01-04 Capit… Benzema 1 35199 0.78 0.69
## 4 AUT 2019-01-05 Capit… Benzema 1 32559 0.78 0.69
## 5 AUT 2019-01-06 Capit… Benzema 1 26956 0.78 0.69
## 6 AUT 2019-01-07 Capit… Benzema 1 35018 0.78 0.69
## # … with 14 more variables: instrumentalness <dbl>, key <dbl>, liveness <dbl>,
## # loudness <dbl>, speechiness <dbl>, acousticness <dbl>, tempo <dbl>,
## # valence <dbl>, explicit <dbl>, temp <dbl>, rain <dbl>, snow <dbl>,
## # cloud <dbl>, humidity <dbl>
countries<-countries%>%
select(region=`Alpha-3 code`, lng=`Longitude (average)`, lat=`Latitude (average)`)
head(countries)
## # A tibble: 6 x 3
## region lng lat
## <chr> <dbl> <dbl>
## 1 AFG 65 33
## 2 ALB 20 41
## 3 DZA 3 28
## 4 ASM -170 -14.3
## 5 AND 1.6 42.5
## 6 AGO 18.5 -12.5
An interesting way of understanding the dataset is using interactive data visualization. For example, with our datasets we can show in a map, the total number of streams of each country. To have a clean map, we can use colors and different radius markers. In this case, if the country have more than 9000 million streams, the radius is 12 and the color of the market is red, if the country have between 1200 and 9000 million streams, the radius is 5 and the color of the market is blue, and color is green and radius is 2 otherwise. Besides, a label with the total number of streams in millions is shown when the cursor is on top of a specific marker.
To get the interactive map we will use leaflet library.
dataC<-left_join(data,countries, by="region")
data_reg<- dataC %>%
select(region,streams, lng, lat) %>%
group_by(region, lng, lat) %>%
summarise(Total_Streams = sum(streams)) %>%
arrange(desc(Total_Streams),region) %>%
mutate(millions = Total_Streams/1000000)
getColor <- function(data_reg) {
sapply(data_reg$millions, function(millions) {
if(millions > 9000) {
"red"
} else if(millions > 1200 & millions<=9000) {
"blue"
} else {
"green"
} })
}
getRadius<-function(data_reg) {
sapply(data_reg$millions, function(millions) {
if(millions > 9000) {
12
} else if(millions > 1200 & millions<=9000) {
5
} else {
2
} })
}
map2<-leaflet(data_reg) %>%
addTiles() %>%
setView(lat=53, lng=9, zoom=3)%>%
addCircleMarkers(~lng,~lat, color=getColor(data_reg),radius = getRadius(data_reg), label = data_reg$millions)
map2
With the data filtered and tidy, it is time to do some exploratory data analysis to have a better understanding of the different attributes.
First of all, we are going to show the monthly streaming trends. In order to do so, we group by our entities my the attribute month and summarize the number of streams to the sum of all streams of each month.
m_s_trends<-data%>%
select(date,streams)%>%
group_by(month=format.Date(date,"%m"))%>%
summarise(totStreams=sum(streams)/1000000)
ggplot(m_s_trends, aes(x=month, y=totStreams,group=1))+ geom_line() + geom_point() + xlab("Month of the Year") + ylab("Total Songs Streamed in Millions") +
ggtitle("Spotify Monthly Streaming Trends for 2019")
From the graph we can conclude that the month in which more songs are streamed is in December.
Then, the monthly streaming trends by country (if streams are over 200 millions). As we did before, we prepare the dataset by grouping entities by region and month.
m_s_trends<-data%>%
select(date,streams, region)%>%
group_by(region,month=format.Date(date,"%m"))%>%
summarise(totStreams=sum(streams)/1000000)%>%
filter(totStreams>200)
ggplot(m_s_trends, aes(x=month, y=totStreams,group=region, color=region))+ geom_line() + geom_point() + xlab("Month of the Year") + ylab("Total Songs Streamed in Millions") +
ggtitle("Spotify Monthly Streaming Trends for 2019 by Region")
However, we can see from this new graph that generally more songs are streamed during summer months.
As we are more interested in year season than in each month, we add a column to the dataset containing the year season.
Then, we can show the total number of streams per season.
seasons<-data%>%
mutate(Season=case_when(
date<as.Date("2019-03-19") | date>=as.Date("2019-12-22") ~ "Winter",
date<as.Date("2019-06-20") & date>=as.Date("2019-03-19") ~ "Spring",
date<as.Date("2019-09-22") & date>=as.Date("2019-06-20") ~ "Summer",
date<as.Date("2019-12-22") & date>=as.Date("2019-09-22") ~ "Fall"
))
seasons_streams<-seasons%>%
group_by(Season)%>%
summarise(streams=sum(streams)/1000000)
ggplot(seasons_streams, aes(x=Season, y=streams, fill=Season, label=streams)) +
geom_bar(width = 1, stat = "identity")+ggtitle("Streams per Season")
As we already conclude, during summer more songs are played.
From the dataset we can determine the 10 songs that have been longer in the top 10 in 2019 and the top 10 artists, based on the total number of streams of all their songs contained in the dataset.
popular<-data%>%
filter(position<=10)%>%
mutate(month=format.Date(date,"%m"))%>%
select(track_name,month, position)%>%
group_by(track_name)%>%
summarise(totStreams=n())%>%
arrange(desc(totStreams))%>%
slice(1:10)
pie(popular$totStreams, popular$track_name, border="white", col=brewer.pal(10,"Set3"), main="The 10 Songs Longer in the Top 10")
m_s_a_top10 <- data%>%
select(artist,streams)%>%
group_by(artist)%>%
summarise(Streams=sum(streams)/1000000)%>%
arrange(desc(Streams))%>%
slice(1:10)
m_s_a_top10$artist <- factor(m_s_a_top10$artist, levels=m_s_a_top10$artist[order(desc(m_s_a_top10$Streams))])
ggplot(m_s_a_top10, aes(x=artist, y=Streams))+ geom_bar(stat="identity", fill="navy")+theme(axis.text.x = element_text(face = "bold", angle = 45, hjust=1))+ ggtitle("Most Listened Artists")+xlab("Artist")+ ylab("Total number of streams")
From the two previous figures, we can conclude that the songs that have been longer in the top 10 do not necessarily correspond to the top 10 artists.
Now we are going to focus on the different features.
First of all we are going to compare the features of the top 10 songs.
m_s_t_top10 <- left_join(popular,data,by = "track_name")
m_s_t_top10_2 <- m_s_t_top10 %>% select(track_name,totStreams,danceability,energy,speechiness,acousticness,liveness,valence) %>% arrange(desc(totStreams))
ggplot(m_s_t_top10_2, aes(track_name)) +
geom_line(aes(y=danceability, color="danceability",group=1)) +
geom_line(aes(y=energy, color="energy",group=2)) +
geom_line(aes(y=speechiness, color="speechiness",group=3)) +
geom_line(aes(y=acousticness, color="acousticness",group=4)) +
geom_line(aes(y=liveness, color="liveness",group=5)) +
geom_line(aes(y=valence, color="valence",group=6)) +
xlab("Tracks") + ylab("Features") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle("Feature Comparison of Top 10 Songs")
From the previous graph we can conclude that most of these songs have similar values for the different features. However, for example valence is the feature that varies the most.
As we said, we want to focus on the different seasons so next, we are going to show the mean value of each feature per season.
ft<-seasons%>%
select(-position,-temp,-snow,-cloud,-humidity)%>%
group_by(Season,month=format.Date(date,"%m"))%>%
summarise(streams=sum(streams)/1000000, danceability=mean(danceability), energy=mean(energy),key=mean(key), liveness=mean(liveness), loudness=mean(loudness), speechiness=mean(speechiness), acousticness=mean(acousticness), tempo=mean(tempo), valence=mean(valence))
ggplot(ft, aes(y=danceability, x=month, color=Season))+ geom_point()+ggtitle("Danceability per month")+xlab("Month") + ylab("Danceability")
ggplot(ft, aes(y=energy, x=month, color=Season))+ geom_point()+ggtitle("Energy per month")+ xlab("Month") + ylab("Energy")
ggplot(ft, aes(y=key, x=month, color=Season))+ geom_point()+ggtitle("Key per month")+xlab("Month") + ylab("Key")
ggplot(ft, aes(y=liveness, x=month, color=Season))+ geom_point()+ggtitle("Liveness per month")+xlab("Month") + ylab("Liveness")
ggplot(ft, aes(y=speechiness, x=month, color=Season))+ geom_point()+ggtitle("Speechiness per month")+xlab("Month") + ylab("Speechness")
ggplot(ft, aes(y=acousticness, x=month, color=Season))+ geom_point()+ggtitle("Acousticness per month")+xlab("Month") + ylab("Acousticness")
ggplot(ft, aes(y=loudness, x=month, color=Season))+ geom_point()+ggtitle("Loudness per month")+xlab("Month") + ylab("Loudness")
ggplot(ft, aes(y=tempo, x=month, color=Season))+ geom_point()+ggtitle("Tempo per month")+xlab("Month") + ylab("Tempo")
ggplot(ft, aes(y=valence, x=month, color=Season))+ geom_point() +ggtitle("Valence per month")+xlab("Month") + ylab("Valence")
By looking at the previous graphs, we can tell that depending on the season, songs with certain features are more common. For example, at first sight, during summer season users commonly choose more danceable and positive (valence) music while during winter occurs the opposite.
More specifically, by what we observed on the previous figures, we are going to focus on “danceability”, “energy” and “valence”.
ggplot(seasons,mapping=aes(group=Season, y=danceability, x=Season, color=Season))+ geom_boxplot() + stat_summary(fun.y=mean, geom="point", color="red", fill="red", size=0.5) + ggtitle("Danceability per Season") + ylab("Danceability")
ggplot(seasons,mapping=aes(group=Season, y=energy, x=Season, color=Season))+ geom_boxplot() + stat_summary(fun.y=mean, geom="point", color="red", fill="red", size=0.5) + ggtitle("Energy per Season") + ylab("Energy")
ggplot(seasons,mapping=aes(group=Season, y=valence, x=Season, color=Season))+ geom_boxplot() + stat_summary(fun.y=mean, geom="point", color="red", fill="red", size=0.5) + ggtitle("Valence per Season") + ylab("Valence")
These three boxplots, supports the previous conclusion as the mean of the proposed features changes depending on the season and it is higher during summer and less during winter.
ss<-seasons%>%
select(-position,-temp,-snow,-cloud,-humidity)%>%
group_by(Season)%>%
summarise(streams=sum(streams)/1000000, danceability=mean(danceability,,na.rm = TRUE), energy=mean(energy,,na.rm = TRUE),key=mean(key,na.rm = TRUE), liveness=mean(liveness,na.rm = TRUE), loudness=mean(loudness,na.rm = TRUE), speechiness=mean(speechiness,na.rm = TRUE), acousticness=mean(acousticness,na.rm = TRUE), tempo=mean(tempo,,na.rm = TRUE), valence=mean(valence,,na.rm = TRUE))
ggplot(ss,mapping=aes(x=Season, y=danceability, group=1)) + geom_line(color="red") + geom_point(size=0.1) + ylab("Danceability") + ggtitle("Mean Danceability per Season")
ggplot(ss,mapping=aes(x=Season, y=energy, group=1)) + geom_line(color="red") + geom_point(size=0.1) + ylab("Energy") + ggtitle("Mean Energy per Season")
ggplot(ss,mapping=aes(x=Season, y=valence, group=1)) + geom_line(color="red") + geom_point(size=0.1) + ylab("Valence") + ggtitle("Mean Valence per Season")
These plots shows the central tendency of the data for the three features we are interested in. On these plots it is shown how the value of the three features changes with the season. We will focus on the first two as we can see there is an special relationship as apparently the value increases from winter to summer.
Now that we have looked at the different attributes and we have completed our analysis and visualization, we are going to do linear regression and hypothesis testing.
From the previous analysis, we believe there is a relationship between certain features and year seasons. To determine if this relation is true, we will test if the null hypothesis of an existing correlation between some features and year season is or not rejected.
ggplot(ss,mapping=aes(y=danceability, x=energy)) + geom_point(aes(color=Season)) + geom_smooth(method=lm) + ylab("Danceability") + xlab("Energy") + ggtitle("Relationship between Danceability, Energy and Season")
In scatter plot, it is shown that energy and danceability are linearly dependent. Also, it is shown how the danceable songs are not really common in winter but increases as summer arrives.
First, we will create two different linear regression models. One between danceability and seasons and the other between energy and seasons.
mod<-seasons%>%
select(-position,-temp,-snow,-cloud,-humidity)%>%
group_by(Season,month=format.Date(date,"%m"))%>%
summarise(danceability=mean(danceability, na.rm = TRUE), energy=mean(energy, na.rm = TRUE))
lm_d <-lm(danceability ~ Season, data=mod)
lm_d %>%
tidy() %>%
select(term, estimate, std.error)
## # A tibble: 4 x 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 (Intercept) 0.685 0.00565
## 2 SeasonSpring 0.0103 0.00799
## 3 SeasonSummer 0.0168 0.00799
## 4 SeasonWinter 0.000972 0.00799
According to the model, danceability increases from fall until summer not constantly.
lm_e <-lm(energy ~ Season, data=mod)
lm_e %>%
tidy() %>%
select(term, estimate, std.error)
## # A tibble: 4 x 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 (Intercept) 0.628 0.00759
## 2 SeasonSpring 0.0165 0.0107
## 3 SeasonSummer 0.0246 0.0107
## 4 SeasonWinter 0.00403 0.0107
According with this other model energy also increases from fall when is the lowest until summer, once again not constantly.
We can also get some global statistics of our models.
lm_d %>%
glance() %>%
select(r.squared, sigma, statistic, df, p.value)
## # A tibble: 1 x 5
## r.squared sigma statistic df p.value
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 0.335 0.0113 2.01 4 0.166
lm_e %>%
glance() %>%
select(r.squared, sigma, statistic, df, p.value)
## # A tibble: 1 x 5
## r.squared sigma statistic df p.value
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 0.358 0.0152 2.23 4 0.138
To finish, we are going to show the residuals distribution of the models.
aug <- lm_d %>%
augment()
ggplot(aug,aes(x=.fitted,y=.resid)) +
geom_point() +
geom_smooth() +
labs(x="fitted", y="residual", title="Residuals vs Fitted. Danceability Model")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
aug <- lm_d %>%
augment()
ggplot(aug,aes(x=.fitted,y=.resid)) +
geom_point() +
geom_smooth() +
labs(x="fitted", y="residual", title="Residuals vs Fitted. Energy Model")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
As we can see above, the residuals in both models cluster around 0. Therefore, we can say that there is a linear relationship between danceability and energy and year season.
As we anticipated at the begining of this tutorial, music can help to connect and understand emotions, but it can also trigger new feelings of euphoria and preparation for a specific event. The hypothesis analysis carried out shows how people tend to choose more energetic and danceable music as summer arrives as for most people summer is a synonym of holiday and incite to be in a good mood.